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Abstract 

The paper advocates the use of a statistical tool dedicated to the exploration of data samples 
populated by several sources of events. This new technique, called gVlot, is able to unfold the 
contributions of the different sources to the distribution of a data sample in a given variable. 
The sVlot tool applies in the context of a Likelihood fit which is performed on the data sample 
to determine the yields of the various sources. 
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1 Introduction 



This paper describes a new technique to explore a data sample when the later consists of 
several sources of events merged into a single sample of events. The events are assumed 
to be characterized by a set of variables which can be split into two components. The 
first component is a set of variables for which the distributions of all the sources of events 
are known: below, these variables are collectively referred to as a (unique) discriminating 
variable. The second component is a set of variables for which the distributions of some 
sources of events are cither truly unknown or considered as such: below, these variables 
are collectively referred to as a (unique) control variable. 

The new technique, termed gVlot, allows to reconstruct the distributions for the control 
variable, independently for each of the various sources of events, without making use of 
any a priori knowledge on this variable. The aim is thus to use the knowledge available 
for the discriminating variable to be able to infer the behavior of the individual sources 
of events with respect to the control variable. An essential assumption for the gVlot 
technique to apply is that the control variable is uncorrelated with the discriminating 
variable. 

The sVlot technique is developed in the context of a data sample analyzed using a 
maximum Likelihood method making use of the discriminating variable. Section 2 is 
dedicated to the definition of fundamental objects necessary for the following. Section 3 
presents an intermediate technique, simpler but inadequate, which is a first step towards 
the s'Piot technique. Section 4 is the core of the document where the gVlot formalism is 
developed (Section 4.1) and its properties explained in detail (Section 4.2). Section 4.3 
then gives instructions about how to implement and use gVlot. Finally, illustrations of 
sVlots are provided with simulated events (Section 4.4) and an application for branching 
ratios measurements (Section 4.5) is briefly described. 

To provide some intuitive understanding of how and why the sVlot formalism works, 
the problem of reconstructing the true distributions is reconsidered in Appendix A, in 
a simpler analysis framework. An extension of the sVlot technique is presented in Ap- 
pendix B. 

2 Basics and definitions 

A common method used to extract parameters from a data sample is the maximum Like- 
lihood method which is briefly reviewed in Section 2.1 since it constitutes the foundation 
of the sVlot technique. Section 2.2 discusses the need for checks of an analysis based on 
the Likelihood method and introduces more precisely the goal of the sVlot technique. 

2.1 Likelihood method 

One considers an extended Likelihood analysis of a data sample in which are merged 
several species of events. These species represent various signal components (ie. sources 
of events in which one is interested) and background components (ie. irrelevant sources 
of events accompagnying the signal components) which all together account for the data 
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sample. The log-Likelihood is expressed as: 

^ = Ei^{E^A(2/e)}-E^- (1) 

e=l i=l 1=1 

where 

• N is the total number of events in the data sample, 

• Ns is the number of species of events populating the data sample, 

• Ni is the number of events expected on the average for the i^^ species, 

• y is the set of discriminating variables, 

• fj is the Probability Density Function (PDF) of the discriminating variables for the 

i^^ species, 

• ii{ye) denotes the value taken by the PDFs fj for event e, the later being associated 
with a set of values ye for the set of discriminating variables, 

• X is the set of control variables which, by definition, do not appear in the above 
expression of C 

The log-Likehhood £ is a function of the Ng yields and, possibly, of implicit free 
parameters designed to tune the PDFs on the data sample. These parameters as well as 
the yields Ni are determined by maximizing the above log-Likelihood. 

2.2 Analysis Validation 

The crucial point for such an analysis of the data sample to be reliable is to use an ex- 
haustive list of sources of events combined with an accurate description of all the PDFs fj. 

To assess the quality of the fit, one may rely on an evaluation of the goodness of fit 
based on the actual value obtained for the maximum of C, but this is rarely convincing 
enough. A complementary quality check is to explore further the data sample by examin- 
ing the distributions of control variables. If the distributions of these control variables are 
known for at least one of the sources of events, one can compare the expected distribution 
for this source to the one extracted from the data sample. In order to do so, one must 
be able to unfold from the distribution of the whole data sample, the contribution arising 
from the source under scrutiny. 

In some instances of control variables, the PDF might even be known for all the 
sources of events. Such a control variable can be obtained for instance by removing 
one of the discriminating variables from the set y before performing again the maximum 
Likelihood fit, and considering the removed variable as a control variable x. Another 
example is provided by a discriminating variable for which the distributions are known 
for all sources of events, but which does not improve significantly the accuracy fo the fit, 
and is not incorporated in the set y, for the sake of simplicity. 

In an attempt to have access to the distributions of control variables, a common 
method consists in applying cuts which are designed to enhance the contributions to the 
data sample of particular sources of events (typically of signal species) . Having enforced 
this enhancement, the distribution of x for the reduced data sample can be used to probe 
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the quality of the fit through a comparison with a Monte Carlo simulated distribution. 
However, the result is frequently unsatisfactory: firstly because it can be used only if 
the signal has prominent features to be distinguished from the background, and secondly 
because of the cuts applied, a sizeable fraction of signal events can be lost, while a large 
fraction of background events may remain. Therefore, the resulting data distribution 
concerns a reduced subsample for which statistical fluctuations, or true anomalies, cannot 
be attributed unambiguously, neither to the signal, nor to the background. For example, 
one can be tempted to misinterpret an anomaly in the distribution of x coming from the 
signal as a harmless background fluctuation. 

The aim of the sVlot formalism developed in this paper is to provide a convenient 
method to unfold the overall distribution of a mixed sample of events in a control vari- 
able X into the sub-distributions of the various species which compose the sample. It 
is a statistical technique which allows to keep all signal events while getting rid of all 
background events, and keeping track of the statistical uncertainties per bin. 

More formally, one is interested in the true distribution (denoted in boldface Mn(x)) 
of a control variable x for events of the n**^ species, the later being any one of the Ng 
signal and background species. The purpose of this paper is to demonstrate that one can 
reconstruct Mn(a;) from the sole knowledge of the PDFs of the discriminating variables 
fj, the first step being to proceed to the maximum Likelihood fit to extract the yields N^. 

As an introduction, in Section 3, the case is considered where the variable x actually 
belongs to the set of y discriminating variables. That is to say that one makes the 
assumption opposite to the interesting one: x is assumed to be totally correlated with y. 
Because of this total correlation, there exists a function of the y parameters which fully 
determines the 'control' variable, x — x{y). In that case, while performing the fit, an a 
priori knowledge of the ^-distributions is implicitly used, thus x cannot play the role of a 
control variable. Although the technique presented in the following Section is inadequate, 
it provides a natural first step towards gVlot. 

Section 4, dedicated to the sVlot formalism, treats the interesting case, where x is 
truly a control variable uncorrelated with y. In that case, while performing the fit, no a 
priori knowledge of the ^-distributions is used. 



3 First step towards sVlot: i^Vlot 

In this Section, one is considering a variable x which can be expressed as a function of 
the discriminating variables y used in the fit. A fit having been performed to determine 
the yields Ni for all species, from the knowledge of the PDFs U and of the values of the 
Ni, one can define naively, for all events, the weight ^ 

Ejfc=l Nkfkiye) 

^It was pointed out to the authors that a weight similar to the naive one of Eq. (2) was introduced 
long ago in [1]. 
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which can be used to build the ^-distribution Mn defined by: 

N^M^{x)5x = J2 MVe) , (3) 

eCSx 

where the sum J2ec6x '^uns over the N^x events for which Xe (i.e. the value taken by the 
variable x for event e) lies in the x-bin centered on x and of total width Sx. 

In other words, N^Mriix)Sx is the x-distribution obtained by histogramming events, 
using the weight of Eq. (2). 

This procedure reproduces, on average, the true distribution Mn(x). In effect, on 
average, one can replace the sum in Eq. (3) by the integral 

( E ) ^ / E NMy)S{xiy) - x)Sx . (4) 

\eC5a:/ '' j=l 

Similarly, identifying the number of events as determined by the fit to be the expected 
number of events, one obtains: 

{n^M^{x)) = dyY.N,f,{y)6{x{y)-x)VM 
^ N^J dyS{x{y) - x%{y) 

= N,Mn{x) . (5) 

Therefore, the sum over events of the naive weight Vn provides a direct estimate of the 
^-distribution of events of the n*'^ species. Plots obtained that way are referred to as 
irxVlots: they provide a correct means to reconstruct 'M.-a{x) only insofar as the variable 
considered is in the set of discriminating variables y. These i^iVlots suffer from a major 
drawback: x being correlated to y, the PDFs of x enter implicitly in the definition of 
the naive weight, and as a result, the Mn distributions cannot be used easily to assess 
the quality of the fit, because these distributions are biased in a way difficult to grasp, 
when the PDFs fi{y) are not accurate. For example, let us consider a situation where, in 
the data sample, some events from the n**^ species show up far in the tail of the Mn(a;) 
distribution which is implicitly used in the fit. The presence of such events implies that 
the true distribution Mn(a;) must exhibit a tail which is not accounted for by M^{x). 
These events would enter in the reconstructed i^iVlot Mn with a very small weight, and 
they would thus escape detection by the above procedure: Mn would be close to Mn, the 
distribution assumed for x. Only a mismatch in the core of the a;-distribution can be 
revealed with \^Vlots. Stated differently, the error bars which can be attached to each 
individual bin of Mn cannot account for the systematical bias inherent to the i^Vlots. 

4 The sVlot technique 

It was shown in the previous Section that if the 'control' variable x belongs to the set y of 
discriminating variables, one can reconstruct the expected distribution of x with i^Vlots. 
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However, the mVlots are not easy to decipher because knowledge of the x distribution 
enters in their construction. 

In this Section is considered the more interesting case where the variable x is truly 
a control variable, i.e. where x does not belong to y. More precisely, the two sets of 
variables x and y are assumed to be uncorrelated: hence, the total PDFs ii{x,y) all 
factorize into products Mj(a;)fj(y). 



4.1 The sVlot formalism 

One may still consider the above distribution M^, but this time the naive weight is no 
longer satisfactory: as shown below, Eq. (5) does not hold. This is because, when summing 
over the events, the x-PDFs M.j(x) appear now on the right hand side of Eq. (4), while 
they are absent in the Likelihood function. However, a simple redefinition of the weights 
allows to overcome this difficulty. 

Considering the naive weight of Eq. (2): 

NMn{x)) = I I dydxJ2NjM,{x)f,{y)5{x-x)Vn 



/ 



dyJ2N,M,{x)i,{y): 



NMy) 



= N^f:M,ix)(N, [dy y^^^'-^^M (6) 

^ N^M^{x) . (7) 

Indeed, as announced, the previous procedure does not apply. In effect, the correction 
term appearing in Eq. (6) 

is not identical to the kroenecker symbol 6jn- The i^^Vlot distribution A^nMn obtained 
using the naive weight is a linear combination of the true distributions Mj . Only if the y 
variable was totally discriminating would one recover the correct answer. In effect, for a 
total discrimination, ij^^[y) vanishes if fn(y) is non zero. Thus, the product fn(y)fj(y) is 
equal to il{y)5jm and one gets: 

But this is purely academic, because, if y was totally discriminating, the obtention of 
Mn(x) would be straightforward: one would just apply cuts on y to obtain a pure sample 
of events of the n**^ species and plot them to get Mn(a;). 

However, in the case of interest where y is not totally discriminating, one observes 
that the correction term is related to the inverse of the covariance matrix, given by the 
second derivatives of — £, which the analysis minimizes: 

, ^ d\-jr,) ^ f fn(ye)f.(ye) 
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On average, replacing the sum over events by an integral (Eq. (4)) the variance matrix 
reads: 



V 



dydxY^NiMi{x)ii{y) 



1=1 



= JdyJ^NMy) 



U{y)m 



{T^,LiNMyW 



ULiNMy) 



Therefore, Eq. (6) can be rewritten 



i=i 

Inverting this matrix equation, one recovers the distribution of interest: 



(11) 



:i2) 



M,{x) 



(13) 



Hence, if the control variable x is uncorrelated with the discriminating variable, the true 
distribution of x can still be reconstructed using the naive weight of Eq. (2), through 
a hnear combination of the inVlots. This result is better restated as follows. When x 
does not belong to the set y, the appropriate weight is not given by Eq. (2), but is the 
covariance- weighted quantity (thereafter called sWeight) defined by: 



YSLiN^hiye] 



(14) 



With this sWeight, the distribution of the control variable x can be obtained from the 
sVlot histogram: 

TVn sM^{x)Sx = sVM , (15) 



which reproduces, on average, the true distribution: 



7Vn.Mn(x)) = iVnMn(x) . 



(16) 



If the control variable x exhibits significant correlation with the discriminating variable y, 
the sVlots obtained with Eq. (15) cannot be compared directly with the pure distributions 
of the various species. In that case, one must proceed to a Monte-Carlo simulation of the 
procedure to obtain the expected distributions to which the gVlots should be compared 
with. 

The fact that the matrix V^j enters in the definition of the sWeights is enlightening, 
and, as discussed in the next Section, this confers nice properties to the gVlots. But this is 
not the key point. The key point is that Eq. (6) is a matrix equation which can be inverted 



8 



using a numerical evaluation of the matrix based only on data, thanks to Eq. (10). Rather 
than computing the matrix by this direct sum over the events, on can use the covariance 
matrix resulting from the fit, but this option is numerically less accurate than the direct 
computation^. 



4.2 gVlot Properties 

Beside satisfying, on the average, the essential asymptotic property Eq. (16), gPlots bear 
properties which hold even under non-asymptotic conditions. 



4.2.1 Normalization 

The distribution defined by Eq. (15) is guaranteed to be normahzed to unity and the 

sum over the species of the gVlots reproduces the data sample distribution of the control 
variable. These two properties are not obvious because, from expression Eq. (14), neither 
is it obvious that the sum over the x-bins of s^n^x is equal to iVn, nor is it obvious 
that, in each bin, the sum over all species of the expected numbers of events equates to the 
number of events actually observed. The demonstration uses the three sum rules below. 

1. Maximum Likelihood Sum Rule 

The Likehhood Eq. (1) being extremal for Nj, one gets the first sum rule: 

Et^nT^^^I, Vi. (17) 

e=l Eik=liVfcffc(ye) 

2. Variance Matrix Sum Rule 

Prom Eq. (10) and Eq. (17) one derives: 

EA/"V~1 — A/" "S^ fi(l/e)fj(l/e) _ ^jiUe) _ -. 

i=l i=l e=l[2^k=l^^ktk[ye)) e=l2^k=l^^ktk[ye) 

3. Covariance Matrix Sum Rule 

Multiplying both sides of Eq. (18) by V^; and summing over j one gets the sum rule: 

Ns Ns Ns / Ns \ Ns 

= Ev,7E^^v-.^ = E Ev^v,,Ui = j^SuN, = n^. (19) 

j=l j=l i=l i=l \j=l J i=l 

It follows that: 

• Each ^-distribution is properly normalized (cf. Eq. (17) and Eq. (19)): 

N N y-Ns f ( \ Ns 

Y.N^M.{^)5x = Y.sVM = E=S4fTrT = ^^^^ = (20) 

[Sx] e=l e=l Z^fe=l -'^fclfelyej j=l 



^Furthermore, when parameters are fitted together with the yields Nj, in order to get the correct 
matrix, one should take care to perform a second fit, where these parameters are frozen. 
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The contributions sVj{ye) add up to the number of events actually observed in each 
x-hm. In effect, for any event (cf. Eq. (19)) : 



Ef=iiV,f,(y,) 



1 . 



(21) 



1=1 1=1 



Therefore, an sVlot provides a consistent representation of how all events from the various 
species are distributed in the control variable x. The contributions to the data sample 
distribution in x from the various species are disentangled according to a fit based on 
the discriminating variable provided x and y are uncorrelated. Summing up the Ng 
sVlots, one recovers the data sample distribution in x, and summing up the number of 
events entering in a sVlot for a given species, one recovers the yield of the species, as it 
is provided by the fit. 

For instance, if one observes an excess of events for a particular n*'^ species, in a given 
x-hin, this excess is effectively accounted for in the number of event resulting from 
the fit. To remove these events (for whatever reason and by whatever means) implies a 
corresponding decrease in A^n- It remains to gauge how significant is an anomaly in the 
x-distribution of the n**^ species. This is the subject of the next Section. 

4.2.2 Statistical uncertainties 

The statistical uncertainty on sM^{x)Sx can be defined in each bin by 



The above asymptotic property is completed by the fact that the sum in quadrature of 
the uncertainties Eq. (22) reproduces the statistical uncertainty on the yield iVn, as it is 
provided by the fit: a[Nn] = y/Ynn- The sum over the x-bins reads: 




(22) 



The proof that Eq. (22) holds asymptotically goes as follows: 





ii{ye)ij{ye) 



(E^uNMye))' 



j=l 1=1 e=l 
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j=l 1=1 1=1 

= V,,, (24) 
and more generally, the whole covariance matrix is reproduced: 

N 

j:isV,){sVj)^Vi,. (25) 

e=l 

Therefore, for the expected number of events per x-bin indicated by the gVlots, the sta- 
tistical uncertainties are straightforward to compute using Eq. (22). The later expression 
is asymptotically correct, and it provides a consistent representation of how the overall 
uncertainty on is distributed in x among the events of the n*^ species. Because of 
Eq. (25), and since the determination of the yields is optimal when obtained using a 
Likelihood fit, one can conclude that the gVlot technique is itself an optimal method to 
reconstruct distributions of control variables. ^ 



4.2.3 Merging ^Vlots 

As a result of the above, two species i and j can be merged into a single species {i + j) 
without having to repeat the fit and recompute the s Weights. The ^P/oi of the merged 
species is just the sum of the two sVlots obtained by adding the sWeights on an event- 
by-event basis: 

7V(,+,)M(,+,)(5x = E isVi + sVj) . (28) 

eCSx 

The resulting gVlot has the proper normalization and the proper error bars (Eqs. (20) 
and (25)): 

= T.is'Pi + sVj)^Ni + Nj (29) 

e=l 

N 

e=l 

= + \jj + 2\ij = V(i+j)(i+j) . (30) 
^This is not the case for inVlots for which one gets: 

N 

J^im-Pj) = NiNjVT.\ (26) 

e=l 

Hence, using the fact, that contrary to sWeights, the i^Vlot weights of Eq. (2) are positive definite, one 
gets: 

JV TV N, Nb 

Y.(V,f < E(^^)(E^^) = ^^E^^V-; =N,< Yu ■ (27) 

e— 1 e— 1 j — 1 j — 1 

That is to say that the statistical uncertainties attached to the i^Vlots are always not only smaller than 
the ones resulting from the fit, but even smaller than the statistical uncertainties obtained in a backgound 
free situation. 
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4.3 s^lot implementation 

This Section is meant to show that using /Plot is indeed easy. The different steps to 
implement the technique are the following: 

1. One is dealing with a data sample in which several species of events are present. 

2. A maximum Likelihood fit is performed to obtain the yields Ni of the various species. 
The fit relies on a discriminating variable y uncorrelated with a control variable x: 
the later is therefore totally absent from the fit. 

3. The sWeights gV are calculated using Eq. (14) where the covariance matrix is ob- 
tained by inverting the matrix given by Eq. (10). 

4. Histograms of x are filled by weighting the events with the sWeights sP. The sum 
of the entries are equal to the yields iVj provided by the fit. 

5. Error bars per bin are given by Eq. (22). The sum of the error bars squared are 
equal to the uncertainties squared Vjj provided by the fit. 

6. The gVlots reproduce the true distributions of the species in the control variable x, 
within the above defined statistical uncertainties. 

The sVlot method has been implemented in the ROOT framework under the class TSPlot [2] . 

4.4 Illustrations 

To illustrate the technique, one considers in this Section an example derived from the 
analysis where gVlots have been first used [3] and [4] (but see also [5]). One is dealing 
with a data sample in which two species are present: the first is termed signal and the 
second background. A maximum Likelihood fit is performed to obtain the two yields Ni 
and A^2- The fit relies on two discriminating variables collectively denoted y which are 
chosen within three possible variables denoted (following the notations of [3]) mEs, 
and J-'. The variable which is not incorporated in y is used as a control variable x. The 
six distributions of the three variables are assumed to be the ones depicted in Fig. 1. 

A data sample being built through a Monte Carlo simulation based on the distributions 
shown in Fig. 1, one obtains the three distributions of Fig. 2. Whereas the distribution 
of AE clearly indicates the presence of the signal, the distribution of tties and are less 
obviously populated by signal. 

Chosing AE and as discriminating variables to determine A^i and N2 through a 
maximum Likelihood fit, one builds, for the control variable which is unknown to the 
fit, the two sVlots for signal and background shown in Fig. 3. For comparison, the PDFs 
of mEs taken from Fig. 1 are superimposed on the sPlots. One observes that the sVlot 
for signal reproduces correctly the PDF even where the latter vanishes, although the error 
bars remain sizeable. This results from the almost complete cancellation between positive 
and negative sWeights: the sum of sWeights is close to zero in the tails while the sum 
of sWeights squared is not. The occurence of negative sWeights is provided through the 
appearance of the covariance matrix, and its negative components, in the definition of 
Eq. (14). 

A word of caution is in order with respect to the error bars. Whereas their sum in 
quadrature is identical to the statistical uncertainties of the yields determined by the 



12 




Figure 1: Distributions of the three different discriminating variables available to perform 
the Likelihood fit: rriEs, ^E, T . Among the three variables, two are used to perform 
the fit while one is kept out of the fit to serve the purpose of a control variable. The 
three distributions on the top (resp. bottom) of the figure correspond to the signal (resp. 
background). The unit of the vertical axis is chosen such that it indicates the number of 
entries per bin, if one slices the histograms in 25 bins. 



t 300 - 




Figure 2: Distributions of the three discriminating variables for signal plus background. 
The three distributions are the one obtained from a data sample obtained through a 
Monte Carlo simulation based on the distributions shown in Fig. 1. The data sample 
consists of 500 signal events and 5000 background events. 
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Figure 3: The gVlots (signal on top, background on bottom) obtained for mEs are repre- 
sented as dots with error bars. They are obtained from a fit using only information from 
AE and JF. The black curves are the PDFs of mss of Fig. 1: these PDFs are unknown 
to the fit. 



fit, and if, in addition, they are asymptotically correct (cf. Section 4.2.2) the error bars 
should be handled with care for low statistics and/or for too fine binning. This is because 
the error bars do not incorporate two known properties of the PDFs: PDFs are positive 
definite and can be non-zero in a given x-bin, even if in the particular data sample at hand, 
no event is observed in this bin. The latter limitation is not specific to gVlots, rather 
it is always present when one is willing to infer the PDF at the origin of an histogram, 
when, for some bins, the number of entries does not guaranty the applicability of the 
Gaussian regime. In such situations, a satisfactory practice is to attach allowed ranges to 
the histogram to indicate the upper and lower limits of the PDF value which are consistent 
with the actual observation, at a given confidence level. Although this is straightforward 
to implement, even when dealing with sWeighted events, for the sake of simplicity, this 
subject is not discussed further in the paper. 

Chosing m-Es and AE as discriminating variables to determine A^i and N2 through 
a maximum Likelihood fit, one builds, for the control variable which is unknown to 
the fit, the two gVlots for signal and background shown in Fig. 4. For comparison, the 
PDFs of taken from Fig. 1 are superimposed on the gVlots. In the gVlot for signal 
one observes that error bars are the largest in the x regions where the background is the 
largest. 
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Figure 4: The gVlots (signal on top, background on bottom) obtained for T are repre- 
sented as dots with error bars. They are obtained from a fit using only information from 
mEs and A£'. The black curves are the PDFs of T of Fig. 1: these PDFs are unknown 
to the fit. 
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4.5 Application: efficiency corrected yields 



Beside providing a convenient and optimal tool to cross-check the analysis by allowing 
distributions of control variables to be reconstructed and then compared with expecta- 
tions, the sT^lot formalism can be applied also to extract physics results, which would 
otherwise be difficult to obtain. For example, one may be willing to explore some un- 
known physics involved in the distribution of a variable x. Or, one may be interested to 
correct a particular yield provided by the Likelihood fit from a selection efficiency which 
is known to depend on a variable x, for which the PDF is unknown. 

To be specific, one can take the example of a three body decay analysis of a species, 
the signal, polluted by background, while the signal PDF inside the two-dimensional 
Dalitz plot is not known, because of unknown contributions of resonances, continuum 
and an interference pattern. Since the x-dependence of the selection efficiency e{x) can 
be computed without a priori knowledge of the x- distributions, one can build the efficiency 
corrected two-dimensional sVlots (cf. Eq. (15)): 

^iV„ M^ix)5x = Yl -j-^sVM) , (31) 

and compute the efficiency corrected yields: 

N^ = y i^^iM . (32) 

Analyses can then use the sVlot formalism for validation purposes, but also, using Eq. (31) 
and Eq. (32), to probe for resonance structures and to measure branching ratios. 



5 Conclusion 

The technique presented in this paper applies when one examines a data sample originat- 
ing from different sources of events: using a set y of discriminating variables, a Likelihood 
fit is performed on the data sample to determine the yields of the sources. By build- 
ing gVlots, one can reconstruct the distributions of variables, separately for each source 
present in the data sample, provided the variables are uncorrelated with the set y used 
in the fit. Although no cut is applied (hence, the gVlot of a given species represents the 
whole statistics of this species) the distributions obtained are pure (free from the potential 
background arising from the other species) in a statistical sense. The more discriminating 
the discriminating variables y, the clearer the sVlot is. The technique is straightfor- 
ward to implement and features several nice properties: both the normalizations and the 
statistical uncertainties of the gVlots reflect the fit ouputs. 
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A Pedagogical examples 

The purpose of this Appendix is to detail in simphfied situations how and why sVlot 
works. One begins with the simplest situation and proceed to more complex ones. 



A.l Simple cut-and-count analysis 

In this Section, a very simple situation is considered where the proper way to reconstruct 
signal and background distributions for a control variable x is obvious from the start. The 
purpose is to observe the gVlot technique at work, when one knows beforehand what the 
outcome should be. 

One considers a data sample consisting of Ng = 2 species: species 1 is referred to as 
the signal and species 2 as the background. A unique discriminating variable y e [0, 1] is 
used in the fit. One further assumes that: 

• the signal distribution is the step-function: 



h{y<yo) = 
ii{y>yo) - (1 



yoY 



(33) 
(34) 



• the background distribution is uniform in the full range: 

Hy) = 1 ■ 



(35) 



Therefore, one is dealing with a cut-and-count analysis: there is a pure background side- 
band for y < yo, and the shapes of the signal and background distributions offer no 
discriminating power in the region where the signal is present, for y > yo- Denoting N 
the total number of events present in the data sample, A^< the number of events located 
below yo, and A^> the number of events located above yo'- 

1. the expected number of background and signal events can be deduced without any 
fit, from the sideband: 



N2 = -Ar< 



1 

yo 



Ni 



1 - yo 



(36) 

(37) 



2. A^< and A^> being two independent numbers of events, the covariance matrix can be 
deduced directly from Eqs. (36)-(37): 



V 




y'o ^ 



yo 



(38) 
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3. denoting 6N^ the number of events in a given a;-bin, with y < yo, the background 
distribution M2(a;) can also be deduced by a mere rescahng of 5N^, as in Eq. (36): 



N2M2{x)Sx^ -^ . 

yo 

Similarly to Eq. (37), the signal distribution is given by: 

iVi Mi{x)5x = -(1 - yo)N2 M2(.x) + 5N^ , 



(39) 



(40) 



that is to say, one can obtain the signal distribution from the (mixed) events pop- 
ulating the domain y > yo, ii one subtracts the contribution of background events, 
which is known from Eq. (39). Stated differently, one is lead to assign the negative 
weight — (1 — yo)/yo to those events in the x-hin which satisfy y < yo- 

Whereas in such a simple situation the use of the sVlot formalism would be awkward, the 
latter should reproduce the above obvious results, and indeed it does. The proof goes as 
follows: 

1. denoting fi(0) (rcsp. fi(l)) the value taken by the PDF of species i for y < yo (resp. 
y > yo), Eq. (17) reads: 



N 



1 - E 



N^il-yo)-' 



A^ifi(0)+A^2f2(0) 



7Vifi(l)+7V2f2(l) 



1 - E 



Ni(l-yo)-^ + N2 



(41) 



+ 



Arifi(0)+Ar2f2(0) 



f2(l) 



7Vifi(l)+7V2f2(l) 



N2 7Vi(l-yo)-^ + A^2 ■ 
The first equation yields: 

N^{l-yo)-^ + N2 = Ny{l-yoV 
and thus, for the second equation: 



iV< 



(42) 



(43) 



(44) 



which leads to Eqs. (36)- (37). 
2. similarly, Eq. (10) yields 



N> 



1 - I/O 



:i - yo? ^ yl 



(45) 
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For example, using Eq. (43), the Vn component is computed as follows: 

" a(E£,iVAfc))^ '(iV,(l-!/o)-'+iV2)' iv>- ^ > 

And similarly for the other components. Inverting one gets Eq. (38). 
3. Eq. (15) then reproduces Eqs. (39)-(40). Namely: 

AT A/r r \- Vllfl(l/e)+Vl2f2(|/e) 

ecSx J:k=lNkfk{ye) 
^ ,^,. Vllfl(0)+Vl2f2(0) Vllfl(l)+Vl2f2(l) 
< iVifi(0)+iV2f2(0) > iVifi(l)+iV2f2(l) 

'''<7V2 Nr{l - yo)-^ + N, 

= (5A^^ — ^^-^ 

+SN^ ^5 



A^>(i-yo)-^ 

1 - yo 



SN^ + 5iV^ (47) 



and: 



M T,^ f \X ^ Vsifl l/e + V22f2(ye) 

^ ,,,. V2lfl(0)+V22f2(0) ,,,. V2lfl(l)+V22f2(l) 

< Ar,f,(0) + iV2f2(0) > 7Vifi(l)+Ar2f2(l) 

Ar2 7Vi(l-yo)-^ + A^2 



'< 



N<yo' 



4. it can be shown as well that Eqs. (18)-(19)-(20)-(21)-(25) hold. 



(48) 



Therefore, in this very simple situation where the problem of reconstructing the distribu- 
tions of signal and background events is glaringly obvious, the sVlot formalism reproduces 
the expected results. 
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A. 2 Extended cut-and-count analysis 



The above example of the previous Section A.l is a very particular case of a more general 
situation where the y-range is split into Hy slices inside which one disregards the shape of 
the distributions of the species, whether these distributions are the same or not. Using 
greek letters to index the ^/-slices, this amounts to replacing the U{y) PDFs by step 
functions with constant values. For each y-bin Ff , these constant values are defined by 
the integral over the y-bin a: 

U{y) = j k{y)dy (49) 

J a 

Uy 

EF? = 1- (50) 

a=l 

With this notation, the number of events Na expected in the slice a is given by: 

N^^Y^NiF?- (51) 

1=1 

To make particularly obvious what must be the outcome of the sVlot technique, in the 
previous Section it was assumed that Uy = = 2, and that the signal was utterly absent 
in one of the two y-shces: Fj = 0, F^ = 1, F2 = yo and Fl — 1 — yQ. 

Below one proceeds in two steps, first considering the more general case where only 
Uy = Ns is assumed (Section A.2.1), then considering the extended cut-and-count analysis 
where Uy > Ng (Section A. 2. 2). Since the general case discussed in the presentation of the 
sVlot formalism corresponds to the limit Uy ^ 00, what follows amounts to a step-by-step 
new derivation of the technique. 



A.2.1 Generalized cut-and-count analysis: Uy — Ng 

When the number of y-slices equals the number of species, the solution remains obvious, 
if the Ng X Ng matrix F" is invertible (if not, the Ni cannot be determined). In that case, 
one can identify the expected numbers of events Na with the observed number of events 
Na, and thus: 

1. one recovers the expected number of events Ni from the numbers of events Na ob- 
served in the Uy slice, by inverting Eq. (51): 

Ni ^ Na{F-')t , (52) 

a=l 



2. the number Na being statistically independent, one obtains directly the covariance 
matrix: 

V., = E^a(F"')r(F-^)J, (53) 



a=l 
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3. similarly to Eq. (51), the number of events ^A'^^ observed in the y-slice a and in the 
bin X of width bx is given by: 

bNl = Y^NMi{x)bxY't (54) 
and thus, the x-distribution of species i is: 

Ns 

57Vf = N,yilx)8x = ^Ki^'^Ti ■ (55) 

a=l 

It remains to be shown that Eq. (55) is reproduced using the ^lot formalism. First, 
using Eq. (49) and Eq. (51), one observes that: 

V ^ ^^^^^ , , ^ E^aT^^ = E^a^f = EF," = 1, (56) 

iiE£iiv,f,(y,) Y^iu^^n h h ' ^ ' 

which shows that the obvious solution Eq. (52) is the one which maximizes the extended 
log-Likelihood. Similarly: 

which inverse is given by Eq. (53), and thus: 

Ns V--F" ^= 

N, ,M,(x)<5x - E ' ' = E ^KiF-')? . (58) 

a=l -'^a a=l 

The sVlot formahsm reproduces Eq. (55). 

A. 2. 2 Extended cut-and-count analysis: Uy > Ng 

In the more general situation where the number of y-slices is larger than the number of 
species, there is no blatant solution neither for determining the Ni, nor for reconstructing 
the ^-distribution of each species (in particular, Eq. (52) is lost). Because of this lack 
of an obvious solution, what follows is a rephrasing of the derivation of the sVlots, but 
taking a different point of view, and in the case where the y-distributions are binned. 

The best determination of the Ni (here as well as in the previous simpler situations) 
is provided by the Likehhood method which yields (cf. Eq. (17)): 

J]^^ = 1 ,Vi (59) 

with a variance matrix (cf. Eq. (10)): 

ny papa 

E^a ^3 y , (60) 

a=l {2^k=l k'^ k ) 
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from which one computes the covariance matrix Vjj. Instead of Eq. (52) the number of 
events Ni provided by Eq. (59) is shown below to satisfy the equahty (cf. Eq. (20)): 

Uy 

A^i = E (s^P)? , (61) 

a=l 

where the matrix element (sP)f is the sWeight (Eq. (14)) for species i of events with y^, 
lying in the y-slice a, namely: 

y^Ns -17- -pa 

= t^' jy • (62) 

The identity of Eq. (61) is not asymptotic, it holds even for finite statistics, since the 
contractions with V;~^ of both the left- and right-hand sides yield the same result. Indeed 
(Eq. (18)): 

k ' ' kh{T.f=,N,Flf k {T^kUN.FlY i^iE^iiV.F^ 

(63) 

which is identical to: 



(64) 

Since Eq. (61) holds for the complete sample of events, it must hold as well for any 
sub-sample, provided the splitting into sub-samples is not correlated with the variable y. 
Namely, for all x-bin, one is guaranteed to observe, on average, the same relationship 
between the numbers of events 5Nf and 5N^. The gVlot obtained from the weighted sum 

Uy 

5Nt = E isV)t , (65) 

is an unbiased estimator of the true distribution of x for species i. One can provide a 
direct proof that the above gVlot of Eq. (65) reproduces the true distribution by following 
the same line which leads to Eq. (12). On average, using successively: 

{5K) = Y.NMi{x)F'^5x (66) 
1=1 

and hence: 

(TV,) = = EMK^)E^fcF^ - Y.NkFt, (67) 

X X k=l k=l 



one gets: 

{EsKisV)t) = E E^^M,(x)Frfe ^y y 

\a=l I a=l \i=l / 2^k=l^^ki^k 
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Ns / Ns riy papa 



- <■ \ / 1 "J Z / V^Ns AT- -pc 

1=1 \j=l a=l Z^fc=l ^^fc-T fc 

Na / Ns riy papa 



Ns / Ns 

1=1 \j=l 
= NiMi{x)5x = {6N^) , (68) 

which concludes the discussion of the situation where the ^/-distributions are step func- 
tions. 



B Extended g'Plots: a species is known (fixed) 

It may happen that the yields of some species are not derived from the data sample at 
hand, but are taken to be known from other sources of information. Here, one denotes 
collectively as species '0' the overall component of such species. The number of expected 
events for species '0', A^o, being assumed to be known, is held fixed in the fit. In this 
Section, the indices run over the Ng species for which the yields are fitted, the 
fixed species '0' being excepted ^ 0). 

One can meet various instances of such a situation. Two extreme cases are: 

1. the species '0' is very well known, such that the information on it contained by the 
data sample at hand is irrelevant. Not only is Nq already pinned down by other 
means, but Mo(a;), the marginal distribution of the fixed species, is available, 

2. the species '0' is poorly known, and the data sample at hand is unable to resolve its 
contribution. This is the case if the y variables cannot discriminate between species 
'0' against any one of the other Ng species. Stated differently, if iVo is left free to vary 
in the fit, the covariance matrix blows up for certain species and the measurement 
is lost. To avoid that, one is lead to accept an a priori value for A^qi ^-nd to compute 
systematics associated to the choice made for it. In that case, the worst case scenario 
is met if Mo(a;) is unknown as well. 

It is shown below that the gT^^ot formahsm can be extended to deal with this situation, 
whether or not Mo(x) is known, although in the latter case the statistical price to pay 
can be prohibitive. 

B.l Assuming Mq to be known 

Here, it is assumed that Mo(x), is taken for granted. Then, it is not difficult to show 
that the Extended gVlot, which reproduces the marginal distribution of species n, is now 
given by: 

sMn{x)5x = CnMo{x)Sx + sV^ , (69) 

eCSx 

where: 
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sV^ is the previously defined sWeight of Eq. (14): 

V V f- 



^ " EfeiV.ffe + iVofo ' ^ ^ 

wfiere the covariance matrix Vjj is the one resulting from the fit of the N^^q expected 
number of events, that is to say the inverse of the matrix: 

N ft. 

= V (71) 

• Cn is the species dependent coefficient: 

Cn = A^n-EVn, • (72) 
3 

Some remarks deserve to be made: 

• The Likelihood is now written: 

>C = E In { E NMVe) + A^ofo(ye)} - { E + ^0} . (73) 

e=l 1=1 1=1 

Because A^o is held fixed, in general, its assumed value combined with the fitted 
values for the A^j, does not maximize it: 

^ = y 1^0. (74) 

• It follows that the sum over the number of events per species does not equal the 
total number of events in the sample: 



(75) 



Similarly, the Variance Matrix Sum Rule Eq. (18) holds only for A'^o = 0: 

Y.N,Y-^^l-N^v^ , (76) 

i 

where the vector Vj is defined by: 



^ fnf, 



^'~^AT.kNdk + Noio? ■ ^^^^ 



e= 



Accordingly, Eq. (19) becomes: 

EV,, = Ar, + AroEV,,^, . (78) 
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• Thus, as they should, the Cn coefficients vanish only for Nq = 0: 

Cn^-NoY.'VnjVj . (79) 
j 

• The above defined Extended gVlots share the same properties as the gVlots: 

1. They reproduce the true marginal distributions, as in Eq. (16). 

2. In particular, they are properly normalized, as in Eq. (20). 

3. The sum of reproduces a'^[N^, as in Eq. (24). 

B.2 Assuming Mo to be unknown 

In the above treatment, because one assumes that a special species '0' enters in the sample 
composition, the sWcights per event do not add up to unity, as in Eq. (21). Instead one 
may define the sWeights for species '0' as: 

^V, = l-Y.sVi (80) 

i 

and introduce the reconstructed sMq distribution (normalized to unity): 

sMo{x)5x ^ [n^Y.^:^ Y^sVo, (81) 

V hj / cC8x 

which reproduces the true distribution Mo(x) if (by chance) the value assumed for Nq is 
the one which maximizes the Likelihood. 

Taking advantage of sMo(x), one may redefine the Extended gVlots by: 

sM^{x)Sx = Cn sMo{x)Sx +Y,sV^=Yl esV^ , (82) 

e<zSx e<Z&x 

where the redefined sWeight which appears on the right hand side is given by: 

,sV^ = sV^ + §-^^sV, . (83) 

It does not rely on a priori knowledge on the true distribution Mo(x). With this redefi- 
nition, the following properties hold: 

• The set of reconstructed ^-distributions A^^Mj of Eq. (82) completed by {N—^i A^j)Mo 
of Eq. (81) are such that they add up in each x-bin to the number of events observed. 

• The normalization constant of the Mq distribution vanishes quadratically with A^q- 
It can be rewritten in the form: 

N-Y.^^J=Nl [vo-Y.^rjV,vA , (84) 
where vq is defined as vj (cf. Eq. (77)) and where the last term is regular when 
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Whereas the normahzation of the redefined extended gVlots remains correct, the 
sum of the redefined sWeights Eq. (83) squared is no longer equal to Vnn- Instead: 

Since the expression on the right hand side is regular when A^o 0, it follows that 
there is a price to pay to drop the knowledge of Mo(x), even though one expects a 
vanishing A^o- Technically, this feature stems from 

Y.isVof^Y.sV,^N-Y,Y,,. (86) 

Hence, the sum in quadrature of the ^Mq uncertainties per bin diverges with A^o 0. 
This just expresses the obvious fact that no information can be extracted on species 
'0' from a sample which contains no such events. 
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